Asynchronous and Distributed Data Augmentation for Massive Data Settings

نویسندگان

چکیده

Data augmentation (DA) algorithms are widely used for Bayesian inference due to their simplicity. In massive data settings, however, DA prohibitively slow because they pass through the full in any iteration, imposing serious restrictions on usage despite advantages. Addressing this problem, we develop a framework extending that exploits asynchronous and distributed computing. The extended algorithm is indexed by parameter $r \in (0, 1)$ called Asynchronous Distributed (AD) with original as its parent. Any ADDA starts dividing into $k$ smaller disjoint subsets storing them processes, which could be machines or processors. Every iteration of augments only an $r$-fraction some positive probability leaves remaining $(1-r)$-fraction augmented unchanged. draws obtained using new old data. For many choices $r$, fractional updates lead significant speed-up over parent it reduces version when $r=1$. We show Markov chain Harris ergodic desired stationary distribution under mild conditions algorithm. demonstrate numerical advantages three representative examples corresponding different kinds settings encountered applications. all these examples, our generalization significantly faster than $r$. also establish geometric ergodicity turn yields asymptotically valid standard errors estimates posterior quantities.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Entropy-based Consensus for Distributed Data Clustering

The increasingly larger scale of available data and the more restrictive concerns on their privacy are some of the challenging aspects of data mining today. In this paper, Entropy-based Consensus on Cluster Centers (EC3) is introduced for clustering in distributed systems with a consideration for confidentiality of data; i.e. it is the negotiations among local cluster centers that are used in t...

متن کامل

Asynchronous Distributed Data Parallelism for Machine Learning

Distributed machine learning has gained much attention due to recent proliferation of large scale learning problems. Designing a high-performance framework poses many challenges and opportunities for system engineers. This paper presents a novel architecture for solving distributed learning problems in framework of data parallelism where model replicas are trained over multiple worker nodes. Wo...

متن کامل

Streaming Algorithms for Distributed, Massive Data Sets

Massive data sets are increasingly important in a wide range of applications, including observational sciences, product marketing, and monitoring and operations of large systems. In network operations, raw data typically arrive in streams, and decisions must be made by algorithms that make one pass over each stream, throw much of the raw data away, and produce \synopses" or \sketches" for furth...

متن کامل

A Distributed and Parallel Clustering Algorithm for Massive Biological Data

Distributed processing today is a largely advantageous technology of bridging together a system of multiple computers and processor systems in running applications. The concept of Distributed processing has allowed time cutting and therefore reduction in costs. Using this, we aim to address clustering techniques in developing new method for further reduction in time and costs. The problem of cl...

متن کامل

Distributed Submodular Cover: Succinctly Summarizing Massive Data

How can one find a subset, ideally as small as possible, that well represents a massive dataset? I.e., its corresponding utility, measured according to a suitable utility function, should be comparable to that of the whole dataset. In this paper, we formalize this challenge as a submodular cover problem. Here, the utility is assumed to exhibit submodularity, a natural diminishing returns condit...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Journal of Computational and Graphical Statistics

سال: 2022

ISSN: ['1061-8600', '1537-2715']

DOI: https://doi.org/10.1080/10618600.2022.2130928